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Abstract 

Reducing the amount of human supervision is a key problem in machine learn¬ 
ing and a natural approach is that of exploiting the relations (structure) among 
different tasks. This is the idea at the core of multi-task learning. In this context 
a fundamental question is how to incorporate the tasks structure in the learning 
problem. We tackle this question by studying a general computational framework 
that allows to encode a-priori knowledge of the tasks structure in the form of a 
convex penalty; in this setting a variety of previously proposed methods can be 
recovered as special cases, including linear and non-linear approaches. Within this 
framework, we show that tasks and their structure can be efficiently learned con¬ 
sidering a convex optimization problem that can be approached by means of block 
coordinate methods such as alternating minimization and for which we prove con¬ 
vergence to the global minimum. 


1 Introduction 

Current machine learning systems achieve remarkable results in several challenging 
tasks, but are limited by the amount of human supervision required. Leveraging simi¬ 
larity among different problems is widely acknowledged to be a key approach to reduce 
the need for supervised data. Indeed, this idea is at the basis of multi-task learning, 
where the joint solution of different problems (tasks) has the potential to exploit tasks 
relatedness (structure) to improve learning accuracy. This idea has motivated a variety 
of methods, including frequentist ll^ [3 14| and Bayesian methods (see e.g. ||T1 and 
references therein), with connections to structured learning ||6lO- 
The focus of our study is the development of a general regularization framework to 
learn multiple tasks as well as their structure. Following ll25l[T5]| we consider a set¬ 
ting where tasks are modeled as the components of a vector-valued function and their 
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structure corresponds to the choice of suitable functional spaces. Exploiting the the¬ 
ory of reproducing kernel Hilbert spaces for vector-valued functions (RKHSvv) ll25l . 
we consider and analyze a flexible regularization framework, within which a vari¬ 
ety of previously proposed approaches can be recovered as special cases, see e.g. 
Gil EH |26l |37l m E]. Our main technical contribution is a unifying study of the 
minimization problem corresponding to such a regularization framework. More pre¬ 
cisely, we devise an optimization approach that can efficiently compute a solution and 
for which we prove convergence under weak assumptions. Our approach is based on 
a barrier method that is combined with block coordinate descent techniques ||3^ l30l . 
In this sense our analysis generalizes the results in [31 for which a low-rank assump¬ 
tion was considered; however the extension is not straightforward, since we consider 
a much larger class of regularization schemes (any convex penalty). Up to our knowl¬ 
edge, this is the hrst result in multi-task learning proving the convergence of alternating 
minimization schemes for such a general family of problems. 

The RKHSvv setting allows to naturally deal both with linear and non-linear models 
and the approach we propose provides a general computational framework for learning 
output kernels as formalized in GHl . 

The rest of the paper is organized as follows: in SecElwe review basic ideas of regular¬ 
ization in RKHSvv. In Sec. 12.31 we discuss the equivalence of different approaches to 
encode known structures among multiple tasks. In Sec.E]we discuss a general frame¬ 
work for learning multiple tasks and their relations where we consider a wide family of 
structure-inducing penalties and study an optimization strategy to solve them. This set¬ 
ting allows us, in Sec.|4] to recover several previous methods as special cases. Finally 
in Sec.|5]we evaluate the performance of the optimization method proposed. 

Notation. With S’".!, C S'" C S'” C we denote respectively the space of 

positive definite, positive semidefinite (PSD) and symmetric nxn real-valued matrices. 
O" denotes the space of orthonormal nxn matrices. For any square matrix M G 
and p > 1, we denote by ||M||p = (X]r=i the p-Schatten norm of M, 

where ai{M) is the *-th largest singular value of M. For any M G denotes 

the transpose of M. For any PSD matrix A G S", A'^ denotes the pseudoinverse of 
A. We denote by In G S"^ the nxn identity matrix. The notation Ran(M) C R™ 
identifies the range of columns of a matrix M G R™^". 

2 Background 

We study the problem of jointly learning multiple tasks by modeling individual task- 
predictors as the components of a vector-valued function. Let us assume to have T 
supervised scalar learning problems (or tasks), each with a “training” set of input- 
output observations St = with xn G X input space and yn G y output 

spac^H Given a loss function £ : R x R —R+ that measures the per-task prediction 

*To avoid clutter in the notation, we have restricted ourselves to the typical situation where all tasks share 
same input and output spaces, i.e. Xt = X and Tt C R. 


2 



errors, we want to solve the following joint regularized learning problem 

T , rat 

minimize ^ + A||/||^ (1) 

, T . T 

where Ti. is an Hilbert space of vector-valued functions f : X ^ J^^with scalar com¬ 
ponents ft ■. X ^ y. In order to dehne a suitable space of hypotheses TL, in this 
section we briefly recall concepts from the theory of reproducing kernel Hilbert spaces 
for vector-valued functions (RKHSvv) and corresponding regularization theory, which 
plays a key role in our work. In particular, we focus on a class of reproducing kernels 
(known as separable kernels) that can be designed to encode specific tasks structures 
(see ifTSl I^ and Sec. 12.31) . Interestingly, separable kernels are related to ideas such as 
dehning a metric on the output space or a label encoding in multi-label problems (see 
Sec.lOl 

Remark 2.1 (Multi-task and multi-label learning). Multi-label learning is a class of 
supervised learning problems in which the goal is to associate input examples with 
a label or a set of labels chosen from a discrete set. In general, due to discrete na¬ 
ture of the output space, these problems cannot be solved directly; hence, a so-called 
surrogate problem is often introduced, which is computationally tractable and whose 
solution allows to recover the solution of the original problem 11^ 171 12811 . 

Multi-label learning and multi-task learning are strongly related. Indeed, surrogate 
problems typically consist in a set of distinct supervised learning problems (or tasks) 
that are solved simultaneously and therefore have a natural formulation in the multi¬ 
task setting. For instance, in multi-class classification problems the “One vs All” strat¬ 
egy is often adopted, which consists in solving a set of multiple binary classification 
problems, one for each class. 

2.1 Learning Multiple Tasks with RKHSvv 

In the scalar setting, reproducing kernel Hilbert spaces have already been proved to be 
a powerful tool for machine learning applications. Interestingly, the theory of RKHSvv 
and corresponding Tikhonov regularization scheme follow closely the derivation in the 
scalar case. 

Definition 2.2. Let (Tf, (•,•)•«) be a Hilbert space of functions from X to A 
symmetric, positive definite, matrix-valued function T : X x X ^ is called a 

reproducing kernel for TL if for all x G X, c G and f G TL we have that r(a;, ■)c G 
TL and the following reproducing property holds {f{x), c)rt = (/, r(a;, ■)c)h- 

In analogy to the scalar setting, it can be proved (see ll25l l that the Representer 
Theorem holds also for regularization in RKHSvv. In particular we have that any 
solution of the learning problem introduced in Eq. ([T]) can be written in the form 

T nt 

fi^) = ^)cf ^ (2) 
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with coefficient vectors. 

The choice of kernel F induces a joint representation of the inputs as well as a structure 
among the output components ijT]; In the rest of the paper we will focus on so-called 
separable kernels, where these two aspects are factorized. In Section [3 we will see 
how separable kernels provide a natural way to learn the tasks structure as well as the 
tasks. 

2.2 Separable Kernels 

Separable (reproducing) kernels are functions of the form rjx, a:') = k(x,x')A \/x,x' G 
X where k : X x X ^ M. is a scalar reproducing kernel and A G S'J is a positive 
semi-dehnite (PSD) matrix. In this case, the representer theorem allows to rewrite 
problem ([T]i in a more compact matrix notation as 

minimize V(V, KCA) + A tr{AC^KC). (V) 

CgR"X^ 

Here Y G is a matrix with n = rows containing the output points; 

K G S!l is the empirical kernel matrix associated to k and V >• R_|_ 

generalizes the loss in ([T]) and consists in a linear combination of the entry-wise ap¬ 
plication of £. Notice that this formulation accounts also the situation where not all 
training outputs are observed when a given input x G X is provided; in this case 
the functional V weights 0 the loss values of those entries of Y (and the associated 
entries of KCA) that are not available in training. 

Finally, the second term in (|0 follows by observing that, for a\\ f G Ti. of the form 
/(•) = X]r=i ■)Aci, the squared norm can be written as \\f\\h = ^j)ci Acj 

tr{AC^KC) where C G R”^^ is the matrix with f-th row corresponding to the co¬ 
efficient vector Ci G R^ of /. Notice that we have re-ordered the index i to be in 
{1,..., n} to ease the notation. 

2.3 Incorporating Known Tasks Structure 

Separable kernels provide a natural way to incorporate the task structure when the 
latter is known a priori. This strategy is quite general and indeed in the following we 
comment on how the matrix A can be chosen to recover several multi-task methods 
previously proposed in contexts such as regularization, coding/embeddings or output 
metric learning, postponing a more detailed discussion in the supplementary material. 
These observations motivate the extension in Sec. 0 of the learning problem to a 
setting where it is possible to infer A from the data. 

Regularizers. Tasks relations can be enforced by devising suitable regularizers ifTSl . 
Interestingly, for a large class of such methods it can be shown that this is equivalent to 
the choice of the matrix A (or rather its pseudoinverse) ll25l . If we consider the squared 
norm of a function / = Xr=i ■)Aci G TLwe have (see ifTSl l 

T 

(3) 

t,S—l 


4 


where At is the t-th column of A, Hk is the RKHS associated to the scalar kernel k and 
ft = X]r=i ')^ Ci S T-Lk is the f-th component of /. The above equation sug¬ 
gests to interpret A^ as the matrix that models the structural relations between tasks by 
directly coupling different predictors. For instance, by setting A^ = -f 7(11 ^)/T, 
with 1 G the vector of all Is, we have that the parameter 7 controls the variance 
St=i II /~/t II Wfc of tasks with respect to their mean / = y Y^'t=i ft- If we have ac¬ 
cess to some notion of similarity among tasks in the form of a graph with adjacency ma¬ 
trix IF G S'’^, we canconsidertheregularizer^^g^^ Wt,s||/t —/s|||ij ^+7 

which corresponds lo A"^ — L + ^It with L the graph Laplacian induced by W. 

Output Metric. A different approach to model tasks relatedness consists in choos¬ 
ing a suitable metric on the output space to reflect the tasks structure ll24ll . Clearly 
a change of metric on the output space with the standard inner product (y, y/)RT be¬ 
tween two output points y, yf G 3^^ corresponds to the choice of a different inner 
product {y,yi)e = {y,6yf)^T for some positive definite matrix 0 G *5++- Indeed 
this can be direct related to the choice of a suitable separable kernel. In particular, 
for the least squares loss function a direct equivalence holds between choosing a met¬ 
ric deformation associated to a 0 G and a separable kernel k{-, ■)It or use the 
canonical metric (i.e. with Q = It the identity) and kernel k{-, •)0. The details of this 
equivalence can be found in the supplementary material. 

Output Representation. The tasks structure can also be modeled by designing an 
ad-hoc embedding for the output space. This approach is particularly useful for multi¬ 
label scenarios, where output embedding can be designed to encode complex structures 
such as (e.g. trees, strings, graphs, etc.) unmiiiiii. Interestingly in these cases, or 
more generally whenever the embedding map L : —?> y, from the original to the 

new output space, is linear, then it is possible to show that the learning problem with 
new code is equivalent to O for a suitable choice of separable kernel with A — lA L. 
We refer again to the supplementary material for the details of this equivalence. 

3 Learning the Tasks and their Structure 

Clearly, an interesting setting occurs when knowledge of the tasks structure is not avail¬ 
able and therefore it is not possible to design a suitable separable kernel. In this case a 
favorable approach is to infer the tasks relations directly from the data. To this end we 
propose to consider the following extension of problem 

minimize V{Y, KCA) -f XtriAC^KC) + FiA), 

where the penalty F : S'^ —^ R+ is designed to learn specific tasks structures en¬ 
coded in the matrix A. The above regularization is general enough to encompass a 
large number of previously proposed approaches by simply specifying a choice of the 
scalar kernel and the penalty F. A detailed discussion of these connections is post¬ 
poned to Section]?] In this section, we focus on computational aspects. Throughout, 
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we restrict ourselves to convex loss functions V and convex (and coercive) penalties 
F. In this case, the objective function in ([^ is separately convex in C and A but not 
jointly convex. Hence, block coordinate methods, which are often used in practice, 
e.g. alternating minimization over C and A, are not guaranteed to converge to a global 
minimum. Our study provides a general framework to provably compute a solution 
to problem ([^. First, In Section [TTl we prove our main results providing a charac¬ 
terization of the solutions of Problem ( f^ and studying a barrier method to cast their 
computation as a convex optimization problem. Second, in Section 13.21 we discuss 
how block coordinate methods can be naturally used to solve such a problem, analyze 
their convergence properties and discuss some general cases of interest. 

3.1 Characterization of Minima and A Barrier Method 

We begin, in Section 13.1.11 providing a characterization of the solutions to Problem 
( f^ by showing that it has an equivalent formulation in terms of the minimization of 
a convex objective function, namely Problem d^ . Depending on the behavior of the 
objective function on the boundary of the optimization domain. Problem might not 
be solved using standard optimization techniques. This possible issue motivates the 
introduction, in Section 13.1.21 of a barrier method; a family of “perturbated” convex 
programs is introduced whose solutions are shown to converge to those of Problem dS 
(and hence of the original (f^). 

3.1.1 An Equivalent formulation for ([^ 

The objective functional in ( f^ is not convex, therefore in principle it is hard to hnd a 
global minimizer. As it turns out however, it is possible to circumvent this issue and 
efficiently hnd a global solution to ( f^ . The following result represents a hrst step in 
this direction. 

Theorem 3.1. Let K S S'” and consider the convex set 

C = {(C, A) e X si I Ran(C^S:C') C Ran(A)} . 

Then, for any F : Sj —> R-|- convex and coercive, problem 

minimizel^(Y, KC) + \tr ATC) + F{A) 

has convex objective function and it is equivalent to ( f^ . In particular, the two prob¬ 
lems achieve the same minimum value and, given a solution (Cr, An) for d^ . the 
couple {CrAI, Ar) is a minimizer for ( f^ . Vice-versa, given a solution {Cq, AQ)for 
( f^ , the couple (CqAq, Aq) is a minimizer for d^ . 

The above result highlights a remarkable connection between the problems ([^ 
(non-convex) and d^ (convex). In particular, we have the following Corollary, which 
provides us with a useful characterization of the local minimizers of problem ( f^ . 

Corollary 3.2. Let Q : x Sj M, be the objective function of problem ( f^ . 

Then, every local minimizer for Q on the open set R"^^ x Sj^_ is also a global mini¬ 
mizer. 
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Corollary [32] follows from Theorem B.ll and the fact that, on the restricted domain 
l^nxT ^ the map Q is the combination of the objective functional of ^R\) and 

the invertible function {C,A) i —> {CA,A). Moreover, if Q is differentiable, i.e. V 
and the penalty F are differentiable, this is exactly the dehnition of a convexifiable 
function, which in particular implies invexity im. The latter property ensures that, in 
the differentiable case, all the stationary points (rather than only local minimizers) are 
global minimizers. This result was originally proved in IITtII for the special case of V 
the least-squares loss and F{-) = || • |||. the Frobenius norm; Here we have proved its 
generalization to all convex losses V and penalties F. 

We end this section adding two comments. First, we note that, while the objective 
function in Problem is convex, the corresponding minimization problem might not 
be a convex program (in the sense that the feasible set C is not identified by a set of 
linear equalities and non-linear convex inequalities ||9l). Second, Corollary (I3.21 l holds 
only on the interior of the minimization domain x S'J and does not characterize 

the behavior of the target functional on its boundary. In fact, one can see that both issues 
can be tackled dehning a perturbed objective functional having a suitable behavior on 
the boundary of the minimization domain. This is the key motivation for the barrier 
method we discuss in the next section. 

3.1.2 A Barrier Method to Optimize (1^ 

Here we propose a barrier approach inspired by the work in ||3| by introducing a pertur¬ 
bation of problem ¥R\i that enforces the objective functions to be equal to -|-c» on the 
boundary of x 5'J. As a consequence, each perturbed problem can be solved as 

a convex optimization constrained on a closed cone. The latter comment is made more 
precise in the following result that we prove in the supplementary material. 

Theorem 3.3. Consider the family of optimization problems 

minimized(r, KC) + Xtr{A-^ (C^KC + 5^ It)) + FiA) 

cgr "><'^, (SA 

with It € the identity matrix. Then, for each 5 > 0 the problem (iiS*^!) ad¬ 

mits a minimum. Furthermore, the set of minimizers for (113 converges to the set of 
minimizers for (Ell as S tends to zero. More precisely, given any sequence 5m > 0 
such that Sm ^ 0 o.nd a sequence of minimizers x for ( 113 , 

there exists a sequence {CmiAffj G R”^^ x Sj) of minimizers for (El) such that 
\\Cm ~ Cmllf + ll^m ~ -A 0 aS ITl ^ -fCX). 

The barrier 5'^tr{A~^) is fairly natural and can be seen as preconditioning of the 
problem leading to favorable computations. The proposed barrier method is similar in 
spirit to the approach developed in Q and indeed Theorem l3.3l and next Corollarv l3.4l 
are a generalization over the two main results in 0 to any convex penalty F on the cone 
of PSD matrices. However, notice that since we are considering a much wider family 
of penalties (than the trace norm as in 0) our results cannot directly derived from 
thos e in 0. In the next section we discuss how to compute the solution of Problem 
(113 considering a block coordinate approach. 
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Algorithm 1 Convex Multi-task Learning 

Input: K, Y, e tolerance, 6 perturbation parameter, S objective functional of 
V loss, F structure penalty. 

Initialize: (C, A) = (Co, Ao),f = 0 

repeat 

Ct+i ^ SupervisedStep (V, K, Y, Ct, At) 

At+i ^ UnsupervisedStep(F, K, 6, Ct+i,At) 
t ^— f -f 1 

until |,5(Ct+i,A+i) - ^(Ct,A)| <e 


3.2 Block Coordinate Descent Methods 

The characteristic block variable structure of the objective function in problem 
suggests that it might be beneficial to use block coordinate methods (BCM) (see IH) 
to solve it. Here with BCM we identify a large class of methods that, in our setting, 
iterate steps of an optimization on C, with A fixed, followed by an optimization of A, 
for C fixed. 

A meta block coordinate algorithm to solve is reported in in Algorithm [T| Here 
we interpret each optimization step over C as a supervised step, and each optimization 
step over A as a an unsupervised step (in the sense that it involves the inputs but not the 
outputs). Indeed, when the structure matrix A is fixed, problem d^ boils down to the 
standard supervised multi-task learning frameworks where a priori knowledge regard¬ 
ing the tasks structure is available. Instead, when the coefficient matrix C is fixed, the 
problem of learning A can be interpreted as an unsupervised setting in which the goal 
is to actually find the underlying task structure f23\ . 

Several optimization methods can be used as procedures for both SupervisedStep 
and UnsupervisedStep in Algorithm [T] In particular, a first class of methods is 
called Block Coordinate Descent (BCD) and identifies a wide class of iterative meth¬ 
ods that perform (typically inexact) minimization of the objective function one block 
of variables at the time. Different strategies to choose which direction minimize at each 
step have been proposed; pre-fixed cyclic order, greedy search ll^ or randomly, ac¬ 
cording to a predetermined distribution 12^ . For a review of several BCD algorithms 
we refer the reader to IMl and references therein. 

A second class of methods is called alternating minimization and corresponds to the 
situation where at each step in Algorithm[T]and exact minimization is performed. This 
latter approach is favorable when a closed form solution exists for at least one block 
of variables (see Section [3.2.1l) and has been studied extensively in ll^ in the abstract 
setting where an oracle provides a block-wise minimizer at each iteration. The follow¬ 
ing Corollary describes the convergence properti es of BCD and Alternate minimization 
sequences provided by applying Algorithm[T]to Hi}. 

Corollary 3.4. Let the Problem (|5fl) be defined as in Theorem \3.3\ then: 

(a) Alternating Minimization: Let the two procedures in Algorithm\J\each provide 
a block-wise minimizer of the functional with the other block held fixed. Then 
every limiting point of a minimization sequence provided by Algorithm\J\ is a 
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global minimizer for {s5. 

(b) Block Coordinate Descent: Let the two procedures in Algorithm\J\each consist 
in a single step of a first order optimization method (e.g. Projected Gradient 
Descent, Proximal methods, etc.). Then every limiting po int o f a minimizing 
sequence provided by Algorithm\I\is a global minimizer for 

Corollary ( 13.41 ) follows by applying previous results on BCD and Alternate mini¬ 
mization. In particular, for the proof of part (a) we refer to Theorem 4.1 in while 
for part (b) we refer to Theorem 2 in ll^ . 

In the following we discuss the actual implementation of both SUPERVISED and Un- 
SUPERVISED procedures in the case where V is chosen to be least-squares loss and the 
penalty F to be a spectral p-Schatten norm. This should provide the reader with a prac¬ 
tical example of how the meta-algorithm introduced in this section can be specialized 
to a specific multi-task learning setting. 

Remark 3.5. (Convergence of Block Coordinate Methods) Several works in multi¬ 
task learning have proposed some form of BCM strategy to solve the learning problem. 
However, up to our knowledge, so far only the authors in lO have considered the issue 
of convergence to a global optimum. Their results where proved for a specific choice 
of structure penalty in a framework similar to that of problem (see Section |4| but 
do not extend straightforwardly to other settings. Corollary 13.41 aims to fill this gap, 
providing convergence guarantees for block coordinate methods for a large class of 
multi-task learning problems. 

3.2.1 Closed Form solutions for Alternating Minimization: Examples 

Here we focus on the alternating minimization case and discuss some settings in which 
it is possible to obtain a closed form solution for the procedures SupervisedStep and 
UnsupervisedStep. 

(SupervisedStep) Least Square Loss. When the loss function V is chosen to be 
least squares (i.e. V(Y, Z) = ||F — Z\\'^p for any two matrices Y, Z € R"^™) and the 
structure matrix A is fixed, a closed form solution for the coefficient matrix C returned 
by the SupervisedStep procedure can be easily derived (see for instance ID): 

vec{C) = {It Z) K + XA~^ 0 In)~^vec{Y). 

Here, the symbol 0 denotes the Kronecker product, while the notation vec{M) G R."™ 
for a matrix M G R"^^™ identifies the concatenation of its columns in a single vector. 
In 12^ the authors proposed a faster approach to solve this problem in closed form 
based on Sylvester’s method. 

(UnsupervisedStep) p-Schatten penalties. We consider the case in which F is 
chosen to be a spectral penalty of the form F{-) = || • ||p with p > 1. Also in this 
setting the optimization problem has a closed form solution, as shown in the following. 
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Figure 1: Comparison of the computational performance of the alternating minimiza¬ 
tion strategy studied in this paper with respect to the optimization methods proposed 
for MTCL in lfT9l and MTFL ||3] in the original papers. Experiments are repeated for 
different number of tasks and input-space dimensions as described in Sec. lO 


Proposition 3.6. Let the penalty of problem be F = \\ ■ ||P with p > 1. Then, 
for any C € fixed, the optimization problem (liS'^l l in the block variable A has a 

minirnizer of the form 


A% = ^+{J{C^KC + (4) 

Proposition 13.61 generalizes a similar result originally proved in in El for the spe¬ 
cial case p = 1 and provides an explicit formula for the UnsupervisedStep of Al- 
gorithm[T] We report the proof in the supplementary material. 


4 Previous Work: Comparison and Discussion 

The framework introduced in problem ( f^ is quite general and accounts for several 
choices of loss function and task-structural priors. Section[3has been mainly devoted 
to derive efficient and generic optimization procedures; in this section we focus our 
analysis on the modeling aspects, investigating the impact of different structure penal¬ 
ties on the multi-task learning problem. In particular, we will briefly review some 
multi-task learning method previously proposed, discussing how they can be formu¬ 
lated as special cases of problem ( f^ (or, equivalently, (f^l. 

Spectral Penalties. The penalty F = || • |||- was considered in lfT4l . together with 
a least squares loss function and the non convex problem ([^ is solved directly by 
alternating minimization. However, as pointed out in Sec. [3 solving the non convex 
problem (although invex, see the discussion on Corollarv l3.2b directly could in princi¬ 
ple become problematic when the alternating minimization sequence gets close to the 
boundary of x A related ideals that of considering F'( A) = tr{A) (i.e. the 
1-Schatten norm). This latter approach can shown to be equivalent to the Multi-Task 
Feature Learning setting of El (see supplementary material). 

Cluster Tasks Learning. In lIT^ . the authors studied a multi-task setting where tasks 
are assumed to be organized in a fixed number r of unknown disjoint clusters. While 
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the original formulation was conceived for linear setting, it can be easily extended to 
non-linear kernels and cast in our framework. Let E G {0,1}^^’’ be the binary matrix 
whose entry Egt has value 1 or 0 depending on whether task s is in cluster t or not. 
Set M = / — E'^E^, and U = ^ 11 ^. In lIT^ the authors considered a regularization 
setting of the form of where the structure matrix A is parametrized by the matrix 
M in order to reflect the cluster structure of the tasks. More precisely: 

A ^{M) = cmU + €b{M — U) + iw{I ~ 

where the first term characterizes a global penalty on the average of all tasks predictors, 
the second term penalizes the between-clusters variance, and the third term controls 
the tasks variance within each cluster. Clearly, it would be ideal to identify an optimal 
matrix A{M) minimizing problem (f^ . However, M belongs to a discrete non convex 
set, therefore authors propose a convex relaxation by constraining M to be in a convex 
set Sc = {M G S’^, 0 A M < I, tr{M) = r}. In our notations F{A) is therefore 
the indicator function over the set of all matrices A = A{M) such that M G Sc- The 
authors propose a pseudo gradient descent method to solve the problem jointly. 

Convex Multi-task Relation Learning. Starting from a multi-task Gaussian Process 
setting, in llJTl . authors propose a model where the covariance among the coefficient 
vectors of the T individual tasks is controlled by a matrix A G in the form of 
a prior. The initial maximum likelihood estimation problem is relaxed to a convex 
optimization with target functional of the form 

||y - KC\\l + Ai tr{C^KC) + Aa tr{A-^C^KC) (5) 

constrained to the set Al = {A | A G ‘5'++i tr{A) = 1). This setting is equivalent to 
problem (by choosing F to be the indicator function of A) with the addition of the 
term tr{C^ KC). 

Non-Convex Penalties. Often times, interesting structural assumptions cannot be 
cast in a convex form and indeed several works have proposed non-convex penalties 
to recover interpretable relations among multiple tasks. For instance dD requires A 
to be a graph Laplacian, or Ha imposes a low-rank factorization of A in two smaller 
matrices. In lIZTlI^ different sparsity models are proposed. 

Interestingly, most of these methods can be naturally cast in the form of problem 
or dsli. Unfortunately our analysis of the barrier method does not necessarily hold also 
in these settings and therefore Alternating Minimization is not guaranteed to lead to a 
stationary point. 


5 Experiments 

We empirically evaluated the efficacy of the block coordinate optimization strategy 
proposed in this paper on both artificial and real datasets. Synthetic experiments were 
performed to assess the computational aspects of the approach, while we evaluated the 
quality of solutions found by the system on realistic settings. 
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0 

MTFL 

0.2333 ± 0.0213 

0.0416 

0.1658 ± 0.0107 

0.0379 

0.1428 ± 0.0083 

0.0281 

0.1311 ± 0.0055 

0.0003 

MTRL 

0.2314 ± 0.0217 

0.0404 

0.1653 ± 0.0112 

0.0401 

0.1421 ± 0.0081 

0.0288 

0.1303 ± 0.0058 

0.0071 

OKL 

0.2284 ± 0.0232 

0.0630 

0.1604 ± 0.0123 

0.0641 

0.1410 ± 0.0087 

0.0350 

0.1301 ± 0.0073 

0.0087 


Table 1: Comparison of Multi-task learning methods on the Sarcos dataset. The advan¬ 
tage of learning the tasks jointly decreases as more training examples became available. 


5.1 Computational Times 

As discussed in Sec. @1 several methods previously proposed in the literature, such as 
Multi-task Cluster Learning (MTCL) ifT^ and Multi-task Feature Learning (MTFL l3ll). 
can be formulated as special cases of problem ( f^ or d^ . It is natural to compare the 
proposed alternating minimization strategy with the optimization solution originally 
proposed for each method. To assess the system’s performance with respect to varying 
dimensions of the feature space and an increasing number of tasks, we chose to per¬ 
form this comparison in an artificial setting. 

We considered a linear setting where the input data lie in and are distributed ac¬ 
cording to a normal distribution with zero mean and identity covariance matrix. T 
linear models wt G for t = 1,... ,T were then generated according to a normal 
distribution in order to sample T distinct training sets, each comprising of 30 examples 
, yf'^) such that yf^ = {wt, xf'^) -\- e with e Gaussian noise with zero mean and 
0.1 standard deviation. On these learning problems we compared the computational 
performance of our alternating minimization strategy and the original optimization al¬ 
gorithms originally proposed for MTCL and MTFL and for which the code has been 
made available by the authors’. In our algorithm we used Aq = I identity matrix as 
initialization for the alternating minimization procedure. We used a least-squares loss 
for all experiments. 

Figure[T]reports the comparison of computational times of alternating minimization and 
the original methods to converge to the same minima (of respectively the functional of 
MTCL and MTFL). We considered two settings: one where the number of tasks was 
hxed to T = 100 and d increased from 5 to 150 and a second one wher d was hxed to 
100 and T varied bewteen 5 and 150. To account for statistical stability we repeated 
the experiments for each couple {T, d) and different choices of hyperparameters while 
generating a new random datasets at each time. We can make two observations from 
these results: 1) in the setting where T is kept fixed we observe a linear increase in the 
computational times for both original MTCL and MTFL methods, while alternating 
minimization is almost constant with respect to the input space dimension. 2) When d 
is fixed and the number of tasks increases, all optimization strategies require more time 
to converge. This shows that in general alternating minimization is a viable option to 
solve these problems and in particular, when T « min{d,n) - which is often the 
case in non-linear settings -this method is particularly efficient. 
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Accuracy (%) per # tr. samples per class 

50 100 150 


STL 

72.23 

±0.04 

76.61 

±0.02 

79.23 

±0.01 

MTFL 

73.23 

±.08 

77.24 

±.05 

80.11 

±.03 

MTRL 

73.13 

±0.08 

77.53 

±0.04 

80.21 

±0.05 

OKL 

72.25 

±0.03 

77.06 

±0.01 

80.03 

±0.01 


Table 2: Classification results on the 15-scene dataset. Four multi-task methods and 
the single-task baseline are compared. 


5.2 Real dataset 

We assessed the benefit of adopting multi-task learning approaches on two real dataset. 
In particular we considered the following algorithms; Single Task Learning (STL) 
as a baseline. Multi-task Feature Learning (MTFL) lO, Multi-task Relation Learning 
(MTRL) 1321, Output Kernel Learning (OKL) lfl4l . We used least squares loss for all 
experiments. 

Sarcos. Sarco^ is a regression dataset designed to evaluate machine learning so¬ 
lutions for inverse dynamics problems in robotics. It consists in a collection of 21- 
dimensional inputs, i.e. the joint positions, velocities and acceleration of a robotic arm 
with 7 degrees of freedom and 7 outputs (the tasks), which report the corresponding 
torques measured at each joint. 

For each task, we randomly sampled 50,100,150 and 200 training examples while we 
kept a test set of 5000 examples in common for all tasks. We used a linear kernel and 
performed 5-fold crossvalidation to find the best regularization parameter according 
to the normalized mean squared error (nMSE) of predicted torques. We averaged the 
results over 10 repetitions of these experiments. The results, reported in Table[Tl show 
clearly that to adopt a multi-task approach in this setting is favorable; however, in order 
to quantify more clearly such improvement, we report in Table [T] also the normalized 
improvement (nl) over single-task learning (STL). For each multi-task method MTL, 
the normalized improvement nl(MTL) is computed as the average 

nl(MTL) ^ ^ nMSE, (STL) - nMSE, (MTL) 

T^exp \/nMSE,(STL) • nMSEi(MTL) 

over all the Uexp = 10 experiments of the normalized differences between the nMSE 
achieved by respectively the STL approach and the given multi-task method MTL. 

15-Scenes. 15-Scene^is a dataset designed for scene recognition, consisting in a 15- 
class classification problem. We represented images using LLC coding llTSl and trained 
the system on a training set comprising 50,100 and 150 examples per class. The test set 
consisted in 7500 images evenly divided with respect to the 15 scenes. Table |2]reports 

^uiihttp://www.gaussianprocess.org/gpml/data/ 

^http://www-cvr.ai.uiuc.edu/ponce_grp/data/ 
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the mean classification accuracy on 20 repetitions of the experiments. It can be noticed 
that while all multi-task approach seem to achieve approximately similar performance, 
these are consistently outperforming the STL baseline. 


6 Conclusions 

We have studied a general multi-task learning framework where the tasks structure 
can be modeled compactly in a matrix. For a wide family of models, the problem of 
jointly learning the tasks and their relations can be cast as a convex program, general¬ 
izing previous results for special cases mm. Such an optimization can be naturally 
approached by block coordinate minimization, which can be seen as alternating be¬ 
tween supervised and unsupervised learning steps optimizing respectively the tasks or 
their structure. We evaluated our method real data, confirming the benefit of multi-task 
learning when tasks share similar properties. 

From an optimization perspective, future work will focus on studying the theoretical 
properties of block coordinate methods, in particular regarding convergence rates. In¬ 
deed, the empirical evidence we report suggests that similar strategies can be remark¬ 
ably efficient in the multi-task setting. From a modeling perspective, future work will 
focus on studying wider families of matrix-valued kernels, overcoming the limitations 
of separable ones. Indeed, this would allow to account also for structures in the inter¬ 
action space between the input and output domains jointly, which is not the case for 
separable models. 
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Appendix 

Imposing Known Structure on the Tasks 

Coding and Embedding 

A common approach to encode knowledge of the tasks relations consists in mapping 
the output space in a new 3^ C and then solve i independent standard learning 
problems (e.g. RLS, SVM, Boosting, etc. ifTTll l or a single one with a joint loss (e.g. 
Ranking 11211 ') using the mapped outputs as training observation. The goal is to im¬ 
plicitly exploit the structure of the new space to enforce known (or desired) relations 
among tasks. 

The most popular setting for these embedding (or coding) methods is multi-class 
classihcation since in several realistic learning problems, classes can be organized in 
informative structures such as hierarchies or trees. Interestingly, due to the symbolic 
nature of the classes representation as canonical basis of nonlinear embeddings are 
not particularly meaningful in classihcation contexts. Indeed the literature on coding 
methods for multi-task learning has been mainly concerned with the design of linear 
operators L : y El. In the following we show that a tight connection exists 

between coding methods and our multi-task learning setting. 

For a hxed linear operator L £ we can solve the “coded” problem using 

the notation of and a kernel of the form F = kl^ with In the £ x £ identity matrix 
(“independent tasks” kernel) 

minimize V (Y, KC) + A triC^ KC) ( 5 ) 

CgR"X^ ^ ’ 

From the Representer theorem we know that the solution of (| 6 ]) will have the form 
= jyi=ik{x,Xi)Lc^, for some Ci G and c* = Lc* G 
L(R^). Therefore, we can constrain (HI) on matrices C = CL with C G R”^^, 
implying that the best solution for (|6l) belongs to the set of functions f = Lo g ^ Hkh 
with 5 GHkiT- 

For those loss functions C that depend only on the inner product between the vec¬ 
tors of prediction and the ground truth (e.g. logistic or hinge ETl [^ . see below), 
the “coded” Problem (| 6 ]) on 3^ with kernel kin is equivalent to on y with kernel 
klJ L. More precisely, if the multi-output loss can be written so that £(y, f{x)) = 
C{{y, f{x))y) for all y G 3^ and a; G A", we have 

iV: f{x))y = {Ly, Lg{x))y = {y, L^Lg{x))y (7) 

where y G y is such that Ly = y and iC denotes the adjoint operator of L (in this 
case just the transpose matrix since L is a linear operator between vector spaces over 
the real held). Therefore, the two terms in the functional of (| 6 ]) become 

V{Y, KC) = V(YL^,KCL^) = V{Y, KCL^L) 

where the last equality makes use of the property in eq. and 

tr{C^KC) = tr{LC^KCL^) = tr{L^ LC^ KC) 


17 


proving the aforementioned equivalence between Problems (|6ll and (0 by choosing 
A = L^L. 


Semantic Label Sharing In ifTTl the authors proposed a strategy to solve a large 
multi-class visual learning problem that exploited the semantic information provided 
by the WordNet ifT^ to enforce specihc relations among tasks. In particular, by de¬ 
signing a “semantic” distance between classes using the WordNet graph, the authors 
were able to generate a similarity matrix L € S'J encoding the most relevant class 
relations. They used this matrix to map the original outputs (i.e. the canonical basis of 
into a new basis where euclidean distances between output codes would reflect the 
semantic ones induced by the WordNet priming. Then they applied a semi-supervised 
One-Vs-All approach on the new output space. 

Output Metric 

In multi-output settings, another approach to implicitly model the tasks relations con¬ 
sists in changing the metric on the output space R^. In particular, we can dehne a 
matrix 0 G S’^ and denote the induced inner product on R^ as (y, y')e = {y, Qy')RT 
for all y, y' G R^. For loss functions C such as those mentioned in Sec. |6] (e.g. 
hinge, logistic, etc.) that depend only on the inner product between observations 
and predictions, we have that for a fixed 0 the new loss is defined as Ce{y, f{x)) = 
£((y, f{x))e) = C{{y, Qf{x))g^T) and induces a learning problem of the form 

minimize V(Y, KCQ) + Xtr(QC^KC) tg) 

CgRnxT V 7 

which is clearly equivalent to solving choosing the kernel kQ. Notice that the 
second term in eq. (I8]l derives from the observation that with the new metric, the 
norm in the RKHSvv becomes II= {f,f)kiT =T,ljTj,sHxi,Xj){ct,Cs)e = 
tr{QC^KC) as required. 

metric learning In i24\ the authors proposed a metric learning framework in which 
both the new metric A (or 0) and the task predictors were estimated simultaneously. 
Adopting almost the same notation of Problem they used the least squares loss 
and imposed a penalty F{A) = —log{det{A)) on the metric/structure matrix. A fur¬ 
ther penalty was also imposed on A, in order to enforce specihc sparsity patterns. 
The only difference with our framework is that in ll24) the authors do not impose the 
regularization term tr{AC^ KC). Notice however that such term allows us to apply 
Theorem [3T| and thus obtain the equivalence between ([^ and ¥R\ . This is extremely 
useful from the optimization perspective since, for instance, for the least squares loss 
and log-determinant penalty mentioned above. Problem is actually convex jointly, 
which is not the case for the framework in ll24l . 
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Learning the tasks and their structure 

Equivalence with the convex problem 

We will make use of the following observation 

Lemma 6.1. Consider K € S^andC € ThenRan(C^KC) = Ran(C^\/K) = 

Ran(C^K). 

Proof. The second equivalence follows directly from the observation that K = 
\/K)\/K and sfK = K{\/~K)f Regarding the first equivalence, recall 

that for any M G = Ran(M) 0 Ker(M), with Ker(M) denoting the null 

space of M. Therefore we can alternatively prove that Ker(C'^iTC') = Ker(C'^v^)- 
Notice that clearly Kstl{C^\ fK) C Ker(C'^iTC'). Now, let x € Kei(C^KC) so that 
0 = x^C^KCx = x^(^/KC)^(VkC)x. This implies that x is a singular vector of 
{'/KC) with singular value equal to zero and therefore x G Ker(C^ V^)- □ 

Proof. (Theorem \3.1\l 

We need to prove that C is a convex set and that tr{A^C^KC) is jointly convex 
on C. Regarding the first part, notice that for A G S'^ and C G R."^^ the constraint 
Ran(C'^iTC') C Ran(A) can be equivalently rewritten as KerKC) A Ker(yl). 
Therefore, using Lemma 16.11 we can check convexity of C by showing that for any 
arbitrary couple (Ai, Ci), (^ 2 , (^ 2 ) G C and any 9 G [0,1] we have Ker(0^i + (1 — 
d)A 2 ) C Kev{0Ci K + {1 — 6 )C 2 K). Let us consider an arbitrary x G Ker(6>Ai + 
(1-6»)A2). Wehave 

0 = x^ {9Ai + (1 — 9)A2)x = 9x^ Aix + (1 — 9)x^ A 2 X. 

Since both Ai and A 2 are PSD, the terms x^ AiX are necessarily non-negative for both 
1 = 1,2. Hence, from the equation above we have x^ AiX = 0, which is equivalent 
to X G Ker(Ai) fl Ker(A 2 ) C Ker(C']^ K) fl Ker(C'J K). This means that x is in the 
nullspace of both Cj K and C 2 K and therefore also in the nullspace of any linear 
combination of the two. In particular x G Ker(0C']^iT + (1 — SjCj K). 

The proof for the convexity of tr{A'^ KC) has been already pointed out else¬ 
where (see for instance jS]). For completeness, we provide an simpler derivation of 
this result which makes use of a Schur’s complement argument and simple algebraic 
properties in line with Cl to show that the epigraph of the function is convex. Con¬ 
sider A G S'^ and C G R"^^. From simple properties of the trace we have the 
equivalence tr{A"^KC) = vec{'/KC)^{A^ 0 lT)vec{y/KC), where 0 identifies 
the Kronecker product and by vec(-) we denote the vectorization operator mapping a 
matrix M G R"^™ to the concatenation of all its columns vec{M) G R"™. Since 
Ran(A) A Ran(C'^iTC') = Ran(CV^) we can apply the generalized Schur’s com- 
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plement to write the epigraph of f{A, C) = KC) as 


epi f = {{t,A,C) \ t>tr{A'^C^KC) = 

vec {CVKy {A^ (g) It) vec {C^/K),iA,C)GC} = 

A® It vec{C'/K) 
vec{Cy/K)^ t 


= \{t,A,C) 


> 0 , 


(AC) gC} 


where we write X for any two symmetric matrices X,Y & S'™ if and only if 
X — Y & S™. Notice that the block components of the matrix in the equation above 
are all linear with respect to A, C and t and therefore the convexity of epi f follows 
by directly observing that for any couple (fi, Ai, Ci), (<2, ^2, ^2) G epi f, the PSD 
constraint holds for any convex combination of the two. 

We hnally prove that the mapping between minimizers stated in Theorem ( 13,11 1. 

First notice that for any (C, A) G x Sj we have Q{C, A) = R{CA, A), with 

{CA,A) G domi? since clearly Ran(Gl) D Kw{AC^KCA). Therefore m/{(5(C', Gl) | C G 
■^nxT^ A G Sj} > inf {RiC, A) I (C, A) G C}. Analogously, given a point (C, A) G 
C we have that R{C,A) = R{CA^A^A) since Ran(C'^iT) C Ran(A) and thus 
V{y,KCAA'^) = V{y,KC). Therefore ^((7, A) = R{CA'^A,A) = Q(C'At,A), 
implying that inf {R{C,A) \ (C, A) G C} > inf {Q(C', A) | C G A G Sj} 

and concluding the proof. □ 


A Barrier Method to Optimize (1^ 

Proof. (Theorem \3.3i To prove the existence of hnite minimizers we need to show that 
there exists a minimizing sequence for such that it converges to a point in domS^ = 
RnxT Sj^. To see this, consider a generic minimizing sequence, i.e. a sequence 
{(Cn, A„)}„gN C domS^ such that S^{Cn,An) —>■ infc,AS^{C, A). Notice that we 
can separate Cn in Cn = Cn, +Cn with (7„ G Ran(Ar) the range of the Gram matrix 
K and Cf; G Ker(Ar) its nullspace and that therefore A„) = {Cn, An). 

This implies that the sequence {Cn, An) is bounded, since, if it was not, we would have 
the coercive penalty F or the tr{Af^Cn KCn) to go to inhnity as n grows. But this 
is not possible since S^{Cn,An) — >■ infc.AS^ {C, A) < + 00 . Therefore {Cn,An) 
admits a converging subsequence. Suppose without loss of generality that {Cn, An) 
converges to a point (C*, A*) G domS^ = R"^^ x S'^. We want to show that 
(C*, A*) is actually in the domS^ — R"^^ x 5++, i.e. that A* is positive dehnite. 
But this is obvious since S > 0 and therefore if the A„ were to converge to a point in 
we would have that tr{Af^) -A +c» and therefore S^{Cn, A„) -A +00 
as n —>■ + 00 . Finally, by the continuity of , we have A„) -A {C*, A*), 

therefore proving that {C*, A*) G argmin^; ^ S^{C, A). 

The second part of the proof requires the following preliminary steps: 

1. minc,AR{C, A) = infA,cS^{C, A) and they have same inhmizers. 

2. g{5) = infA,cS^{C, A) is continuous (in fact convex) with minimum in 0. 
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We prove the first point in Lemma l 6 ^ while the second observation follows from 
the fact that the function g is the point-wise infimum of a jointly convex function over 
a convex set. This requires to show that S^tr{A~^) is jointly convex which follows the 
same reasoning as for the convexity of tr{A~^C^KC) in Theorem ( 13,11 ). 

Let us consider two sequences (5„ > 0 and {[Cm ^n)}rt6N C domS^ = x 

satisfying the hypothesis of the Theorem, i.e. (C„, A^) = minc,AS^’^ [C, A). 
We will first prove the result for Cn in the range of the Gram matrix K. Notice that 
under this requirement, the [Cm An) are bounded, since, analogously as for the proof 
above, if they were not we would have the coercive penalty F or the tr[A~^Cn KCn) 
to go to infinity as n grows. But this is not possible since [Cm An) -A g[0) < -|-cx). 

Therefore, by points 1. and 2., g[0) = minc,AR[C, A) and the limit points of 
[Cn,An) are minimizers for R. This finally implies that there exists a sequence 
{(C*, C argminc,AR[C,A) such that \\Cn - CI\\f + \\An - ^*||f tends 

to zero as n goes to infinity. To see this, suppose by contradiction that it is not true 
and that there exists a subsequence { [Cn ^., )}fceN and an M > 0 such that WCn,. — 
(7*11^ + \\Ank > M for all fc > 0 and for all [C*,A*) G argmin,^ ^ R[C, A). 

Now, since is a subsequence of [Cn,An), we have that; [i) 

is bounded (hence admits a converging subsequence) and [ii) every converging subse¬ 
quence tends to a minimizer of R. This clearly contradicts the hypothesis. 

Now, consider the general case in which Cn is not in the range of K: notice that 
similarly as before, Cn can be separated in Cn = Cn + Cn with Cn G Ran(iT) the 
range of K and C^ G Ker(iT) its nullspace. Clearly, [Cm An) = [Cm An) -A 
g[0) and therefore, from the discussion above we have a sequence {[C*, ^^)}neN C 
argmin(;j^^ A) such that \\Cn — CnWp + \\An — —>■ 0 as n —>■ -l-oo. We 

can now observe that the sequence (C*, A*) = [C* + Cn,An) satisfies the statement 
of the Theorem; indeed [i) the (C*, A*) are minimizers for R since R[Cn,An) = 
R[^n-: -^n) ^^^d [H) ||C„ — G^Hf = ||C„ — G^Hf ~a 0 for n -A -l-c». □ 

Lemma 6.2. minA,cR[C, A) = infA,cR^[C, A) and they have same infimizers: 

Proof. This fact follows from the observation that for all (5 > 0 , domS^ = darns'^ 
is equal to the interior of domR and that all minimizers for R belong to domR. To 
show this second statement we will prove that for any sequence {(Cn, C 

domR and converging to some point [C^A) G x 5”^ \ domR, we have that 

R[Cn,An) -A -|-oo as n goes to infinity. For simplicity of notation let us denote 
B = C^KC and analogously Bn = CjKCn- Since from hypothesis Ran(A) 2 
Ran(C'^iTC) we have that Ker(A) Ker(i3), or, in other words, there exists an 
eigenvectors for A such that v G Ker(A) and ||.Bs ||2 > 0. 

Since the sequence An converges to A, we can identify a sequence of eigenvectors 
Vn for An such that s„ —>■ s and their associated eigenvalue A„ —?> 0 as n goes to 
infinity. Notice that we can assume without loss of generality that A„ > 0 for all n 
since A„ = 0 would imply s„ G Ker(A„) C Ker(R„) but we have from hypothesis 
that ||R„Sn ||2 —>■ \\Bv\\ > 0. Therefore we have 

tr[AlBn) > Xf^VnBnVn = Xf^\\BnVn\\l -A -|-00 

as n goes to infinity. □ 
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Spectral Regularization 

Proposition [3]6] follows directly from the following result 

Proposition 6.3. Let A,M G S'" with Ran(.4) D Ran(M), rank{M) = r. Let M = 
be an eigendecomposition of M with Lf G O" and E G S" a diagonal matrix 
with eigenvalues in decreasing order. Then, there exists a matrix y4* = UTU^ G S" 
with r G S" diagonal with = 0 Vi < r, such that 

tr[A\M) = tr{A^M) and ||^*||p < ||^||p Vp > 1 (9) 

with the equality holding if and only if A^, = A. 

Proof. To keep the notation uncluttered we prove the result for 0 = . Consider an 

eigendecompositionn 0 = SAS^ with S G O" and A G S" diagonal with eigenvalues 
in decreasing order. Let us define R = S G O". Then 

r n r 

tr{QM) = tr{RAR^'E) — ^ ^ Rij^j = ^ cnA 

i—1 j—^ 


where ai and Ai are respectively the i-th eigenvalues of M and 0 and we have defined 
7i = = l i < r and 7 i = 0 otherwise. Hence, if we consider a diagonal 

matrix F G S" such that Tu = yi and set 0' = LfTLf^ we obtain the left equivalence 
of eq. (|9]), namely tr{QAI) = tr{Q'M). Now, consider the p-Schatten norm of 0' 


ll(0')^llp = 





i/p 


Notice that i?y = Uj ■ Sj corresponds to the projection of the i-th eigenvector of AI on 
the j-th eigenvector of 0. Since Ran(0) = Ran(A) D Ran(M), for any eigenvector 
s G R" in the nullspace of 0 (i.e. with associated eigenvalue A = 0), we have that 
• s = 0 for all i < r. Hence, V* < r, 1 = Rj ■ Ri = J2j'=i Rij = 
where k = rank {A). Therefore, since the Rlf^s add up to 1 and the scalar function 
(l/x)P is convex in x G R++, we have 




(E”=. R?jAA 


r k .. 


2=1 j = l 
k 


j—1 i—1 j—1 J 


where we have made use of the fact that for all j = 1 ,..., n we have Ym=i Rij = 
RJ ■ Rj = 1. Therefore, ||(0 O^IIp E ||0^|Ip- By taking A' = (0')^ we have the 
desired result. □ 
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Applied to the minimization in problem with C G fixed and p-Schatten 

penalty, Proposition l6.3l states that a minimizer Ac G S'J has the same system of eigen¬ 
values as KC and their spectrum have same sparsity pattern (i.e. Ran(C'^ ATC) = 
Ran(A)). This observation leads directly to the closed formula to hnd a A^, stated in 
Proposition |T 6 ] 

Proof. (Proposition \3.6i Consider the eigendecomposition KC = M = 
with U G and S G 5'J diagonal with the eigenvalues arranged in descending order. 
We apply Proposition l6.3l and obtain the minimizer A* = UTU^ for T G S'^ diagonal 
with same sparsity pattern as E. We can rewrite the target function as 

-+Xlt. 

It 

where r = rank{M). Therefore, the optimization problem consists in minimizing 
the target function above with respect to the 7 tS. This is an unconstrained convex 
optimization of a differentiable coercive function bounded below and therefore it is 
sufficient to set the gradient to zero and solve with respect to the 74 . It is clear that for 
each f = 1 ... r, the minimizer is of the form 74 = cr* / A, leading to the desired 
solution. □ 



Linear Multi-task Learning 

Several works in multi-task learning have focused on linear models where the multi¬ 
output predictor / : ^ is parameterized by a matrix W G whose 

columns Wt G R'^ are associated to the individual task-predictors ft{x) = {wt,x)g_d 
for any x G R'^. In this tasks structure can be imposed considering suitable matrix 
penalty ^ R and regularization schemes of form 

min. V{Y,XW) + VLiW) (10) 

where X G R"^'^ is the matrix whose rows correspond to the (transposed) input points 
in the training sets, ordered accordingly to the order in yE We can recognize two main 
classes of penalty functions. A hrst class correspond to methods that impose sttuctured 
sparsity on the input features across the multiple tasks, for instance considering the 
penalty n(-) = || • ||2,i ID, which encourages whole rows of W to be simultaneously 
sparse, see also ll20l 1^ . A second class corresponds to spectral regularization methods 
dehned by penalties 17 acting on the singular values of W. Examples in this class 
include methods that impose low-rank assumptions 0 on the tasks, or search after 
tasks-cluster structures HD. Ideas related to a combination of the above methods can 
also be considered Qol. 

Most Linear multi-task learning problems of the form (fTOl i with 17 spectral penalty, 
can be formulated in terms of problem (1^ for a suitable choice of F. Indeed it can be 

^Again V would weight with zeros the loss associated to entries for which examples are not available 
during training 
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shown that for several spectral norms, such as the p-schatten norms, the penalty fl can 
be written as 


n{w) 


inf trace{WA-^W^)+Fn{A) VVF G 


Aes 


r 

+ + 


Here we report the example of the nuclear norm || • ||*, that has already been observed 
in similar form in 13 (TS) and that can be easily derived from Prop. 13.61 for the case 


p=l. 


||fP|l* = - inf trace{WA + trace{A). 




Indeed, from Prop. (13.61) we have that the solution to the minimization problem is 
A^c = ^/lW*opW) and therefore, the minimum of such functional will be exactly 
trace{-\/WW^) = ||M^|1*. 


Impose Tasks Relationships by enforcing structure on 
the feature space 

Relations among tasks can be also modeled by enforcing shared structures on the input 
space. For instance in 13, the authors generalized a feature selection framework to the 
multi-task setting by formulating the linear problem 

minimize V(Y,XUM) +-/WMh i (11) 


where X G is the matrix whose i-th row corresponds to the input vector Xi G 
and the (2, l)-norm ||M|| 2 ,i = is introduced to enforce sparsity among 

the rows of M. This penalty generalizes feature selection to the multi-task case 
by directly manipulating the covariance on the input space. However, since input and 
output distributions are connected by the training data, it is reasonable to expect this 
process to indirectly affect also the covariance on the output space. Indeed, in this 
Section we present an interesting result connecting multi-task problems that impose 
structure on the input covariance and problems that instead aim to control the output 
covariance (i.e. in the form of d^l. 

To show this connection, we need to discuss in more detail the work in H- Al¬ 
though (fTTI) is not convex, the authors prove that there exists an equivalent convex 
formulation of the form 

minimize ViY, XW)^ tr(W^D^W). (12) 

Ran(D)DRan(T4^),ir(D)<l 

The authors then proceed to generalize this framework to the nonlinear case using the 
advantages of the RKHS notation. In this setting, the original idea of identifying a 
low dimensional set of directions in the feature space translates naturally to the prob¬ 
lem of finding a small set of orthogonal directions in the Hilbert space. To this end. 
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the authors perform a preprocessing step whose goal is to identify an orthonormal ba¬ 
sis of functions ijji,.. .ipi G TLk for set spanned by the k{xi, •) and define a matrix 
K G such that Kij = A possible way to do this is by considering a 

eigenvalue decomposition of K and taking K = (taking out from 

the columns equal to zero). It is easy to show that the standard learning problem in 
RKHS settings can be cast equivalently in this new notation. However, this framework 
has the further advantage that it can be generalized to take into account the eventuality 
of a transformation in the feature space, leading to the extension of problem (fTZt for 
the non linear case 

minimize V(Y, KB) + ^ tr{B^B) (13) 

Ran{D)DR'dn{B) ,tr{D)<l 

As can be noticed, the structure of problem (fT3l l is very similar to the one of prob¬ 
lem and indeed, as stated in Corollary 16. 5l the two are equivalent when trace reg¬ 
ularization is imposed on (1^ . However, as shown in Theorem 16.41 a more general 
equivalence holds. 

Theorem 6.4. Let X > 0, p > 1, R"^^, {xi, C x a set of input-output 

pairs with y G R"^^ the matrix whose i-th row corresponds to pi. Letppi ,..., G 
T-Lk be an orthonormal basis for span{k(xi, - and K C R"^^ with Kij = ipj^Xi). 
Then 

minimize S{B,D) = V{Y,kB)+tr{B* B) + X\\D\\p (T) 

Ran(Z)) DRan(S) 

is a convex optimization problem equivalent to with penalty function F(A) = 
II A||p. In particular the two problems achieve the same minimum and, given a mini¬ 
mize r for one problem it is possible to obtain a solution for the other and vice-versa. 

The crucial aspect of the proof of Theorem lh.TI lwhich we prove below) consists in 
identifying the two mappings that allow to obtain a minimizer for problem ¥R\ from a 
solution of O and vice-versa. 

As a corollary of Theorem (16.4b we get the exact equivalence to the problem proposed 
in lE]. 

Corollary 6.5. Problem ( 113b is equivalent to (l7~b for p — In particular the two 
problems achieve the same minimum for A = 7 ^/ 4 . Ai a consequence of Theorem \6.4\ 
this implies also that (O is also equivalent to (1^ when J^(-) = II • II 1 = trf). 

This result follows from the direct comparison of the minimizers for the prob¬ 
lems (13 (from Proposition 13.6b and ( fT3l l (from |I3)- Notice, that although equiva¬ 
lent as convex optimizations, it is in general more convenient to solve problems in the 
form (|3 rather than (|3 since in most cases T << £. 

Proof Theorem \6.4\ 

From the discussion in lE) we can rewrite problem in the equivalent formula¬ 
tion 

minimize T{B, A) = V(Y, KB) + trik^B^B) + X \\A\\p (U) 

Ran(A)DRan(i3^) 
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Therefore, to prove Theorem 16.41 it is sufficient to show that problem (|^ and (El are 
equivalent. Assume without loss of generality T < i. Consider an arbitrary matrix B G 

^ singular value decomposition B = V ^ where 0 G 

identifies a matrix of all zeros, V G 0^,U G and E G S'^ a diagonal matrix with 
eigenvalues in descending order. From Pror)ositon l6.3l we obtain that the minimizers 
of the two functions S{B, •) and T{B, •) are unique and can be written respectively in 
the forms 


^ e 5^ and Ab = UTaU^ G Sl 

where FdjFa G SJ, have same sparsity pattern as E and the zero matrices in the 
formulation of Db are of appropriate dimension. We can therefore write the minimum 
value achieved by S{B,-) as S{B, Db) = V{Y,KB) + fr(r^E^) + A||r£i||p and the 
minimum achieved by T{B, •) as T{B, Ab) = V{Y, KB) + tr(rl^Y'^) + AHFaUp. In 
the light of these equations, it can be easily cheked that by setting = UTbU~^ G 

S’^ we have 

S{B, Db) = T{B, A^^) > T{B, Ab) 

where the inequality follows from the fact that Ab is a minimizer for T{B,-). Anal¬ 
ogously, we can design a matrix D^^^ G such that T{B,Ab) = S{B,D^^^) > 
S{B, Db)- Since the minimizers As and Db are unique, it follows that F/j = F^. In 
the perspective of this result, we have that for any minimizer (i?*, £)*) G x 

for (|3, the couple (i?,, G x 5^ is a minimizer for and further¬ 

more, the two functions achieve the same minimum value. The same result holds in the 
opposite direction. □ 
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